1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
Each dataset has 20 variables, including site codes/names/lat/lon, county codes/names, larger CBSA codes/names, state codes/names. There is also a daily number of observations (1), the daily mean PM2.5 concentration, daily air quality index (AQI) and the date for each day of the year. 2002 has 15976 observation, 2022 has 57775 observations.
There are no missing site IDs or numbers outside that look misentered (all are 8 digits). Daily AQI and PM2.5 concentrations don’t appear to have any missing values or unreasonable values.
Min. 1st Qu. Median Mean 3rd Qu. Max.
60010007 60290014 60590007 60549600 60731002 61131003
summary(PM.2022$Site.ID)
Min. 1st Qu. Median Mean 3rd Qu. Max.
60010007 60311004 60631007 60571692 60771003 61131003
summary(PM.2002$DAILY_AQI_VALUE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 29.00 50.00 53.68 69.00 176.00
summary(PM.2022$DAILY_AQI_VALUE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 18.00 29.00 32.95 45.00 353.00
summary(PM.2002$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(PM.2022$Daily.Mean.PM2.5.Concentration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.200 4.200 7.000 8.574 10.900 302.500
2. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
PM.All <-rbind(PM.2002,PM.2022)library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
3. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
The sites are well distributed between inland and central valley, though there is a paucity of sampling stations in the northern part of the state and in the Mohave Desert/towards the AZ border.
library(leaflet)library(magrittr)# Generating a color paletteYear.Pal <-colorBin(c('red','darkblue'), domain=c(2002,2022))#Make a mapPM.Map <-leaflet(PM.All) %>%addProviderTiles('CartoDB.Positron') %>%addCircles(lat =~lat, lng=~lon,label = PM.All$Site.Name, color =~Year.Pal(c(2002,2022)),opacity =1, fillOpacity =1, radius =500) %>%addLegend('bottomleft', pal=Year.Pal, values =c(2002,2022),title='Year', opacity=1)PM.Map
4. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
In the combined dataset, the minimum value for PM 2.5 is -2.2, which is not plausible. Negative values represent 0.028% of the combined data. The PM 2.5 >300 is extremely high compared to the rest of the dataset, however this value was found at Siskiyou County on July 31, 2022. According to CalFire, there was a fire in the adjacent Shasta county on 7/14/22, so we’ll keep this value.
By comparing the summary stats, we see that although 2022 had a higher maximum PM 2.5 level, 2022 actually had a lower median daily average PM 2.5 compared to 2002 (12 vs 7).
summary(PM.All$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.20 4.50 7.70 10.21 12.40 302.50
length(subset(PM.All, PM2.5<0))/nrow(PM.All)*100
[1] 0.02847419
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
──────────────────────────────
Year
2002 2022
n = 15976 n = 57775
PM 2 5
16.1 (13.9) 8.6 (7.9)
──────────────────────────────
PM.All[which.max(PM.All$PM2.5),]
Date Source Site.ID POC PM2.5 UNITS AQI Site.Name
66524 2022-07-31 AQS 60932001 3 302.5 ug/m3 LC 353 Yreka
DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
66524 1 100 88101
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE STATE
66524 PM2.5 - Local Conditions NA 6 California
COUNTY_CODE COUNTY lat lon Year
66524 93 Siskiyou 41.72689 -122.6336 2022
5. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
In CA, LA County, and Pasadena, there were higher PM2.5 levels in 2002 than in 2022. The distributions were more skewed in all of CA in 2022, but in Pasadena alone, there was less variation and skew in the PM levels in 2022 compared to 2002.
# CAboxplot(PM.All$PM2.5~ PM.All$Year) #skewed, let's use median
PM.All.Table <-table(PM.All$Year,PM.All$PM2.5)barplot(PM.All.Table, col =c("red","lightblue"))
tapply(PM.All$PM2.5, PM.All$Year, summary)
$`2002`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
$`2022`
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.200 4.200 7.000 8.574 10.900 302.500
PM.LAC.Table <-table(PM.LAC$Year,PM.LAC$PM2.5)barplot(PM.LAC.Table, col =c("red","lightblue"))
tapply(PM.LAC$PM2.5, PM.LAC$Year, summary)
$`2002`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.60 11.10 17.40 19.66 25.50 72.40
$`2022`
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.20 7.40 10.40 11.04 13.80 56.00
PM.PAS.Table <-table(PM.PAS$Year,PM.PAS$PM2.5)barplot(PM.PAS.Table, col =c("red","lightblue"))
tapply(PM.PAS$PM2.5, PM.PAS$Year, summary)
$`2002`
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 12.00 17.80 20.29 25.20 57.80
$`2022`
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.500 6.400 7.900 9.094 11.575 22.100